For illustration of the clipping method, lets look at an example.
We have a dataset named nyc_airbnb.csv , which contains data about price of AirBnb per-night rental houses. In the dataset, there exists some outliers in the price column. Our task is to find out the outliers and handle them by winsorizing/clipping.
First , we load our dataset "New York Housing" into a dataframe and view it.
The steps are:
pandas librarynyc using read_csv method in pandasnyc.import pandas as pd
nyc=pd.read_csv("../datasets/nyc_airbnb.csv")
nyc
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
| 1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
| 2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
| 3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
| 4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48890 | 36484665 | Charming one bedroom - newly renovated rowhouse | 8232441 | Sabrina | Brooklyn | Bedford-Stuyvesant | 40.67853 | -73.94995 | Private room | 70 | 2 | 0 | NaN | NaN | 2 | 9 |
| 48891 | 36485057 | Affordable room in Bushwick/East Williamsburg | 6570630 | Marisol | Brooklyn | Bushwick | 40.70184 | -73.93317 | Private room | 40 | 4 | 0 | NaN | NaN | 2 | 36 |
| 48892 | 36485431 | Sunny Studio at Historical Neighborhood | 23492952 | Ilgar & Aysel | Manhattan | Harlem | 40.81475 | -73.94867 | Entire home/apt | 115 | 10 | 0 | NaN | NaN | 1 | 27 |
| 48893 | 36485609 | 43rd St. Time Square-cozy single bed | 30985759 | Taz | Manhattan | Hell's Kitchen | 40.75751 | -73.99112 | Shared room | 55 | 1 | 0 | NaN | NaN | 6 | 2 |
| 48894 | 36487245 | Trendy duplex in the very heart of Hell's Kitchen | 68119814 | Christophe | Manhattan | Hell's Kitchen | 40.76404 | -73.98933 | Private room | 90 | 7 | 0 | NaN | NaN | 1 | 23 |
48895 rows × 16 columns
price data¶The steps are:
plotly.express library as pxpx, call the strip() method to generate the strip plotnyc: variable where the data is storedprice: column data to plot in the y axisprice_strip that will save the plot in this variableprice_strip using the show() methodimport plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
price_strip = px.strip(nyc, y='price')
price_strip.show()
The steps are:
boxplot() method to generate the boxplotcolumn: the column data to plot for the boxplotfigsize(optional): to define the size of the figure in terms of width and heightfontsize(optional): to show the texts size in the figurevert(optional): the allignment(x or y axis) of the plot. Value False means horizontal(x axis) alignment, True vertical alignmentboxplot() method to the variable nyc, where our data is storedbox_pricebox_price variable to display the boxplotbox_price = nyc.boxplot(column='price', figsize=(15,5), fontsize='10', vert=False)
box_price
<AxesSubplot:>
price distribution in terms of numbers:¶The steps are:
price column from the variable nycdescribe() method on the price data. This will show the price data distribution on the five number summarynyc['price'].describe()
count 48895.000000 mean 152.720687 std 240.154170 min 0.000000 25% 69.000000 50% 106.000000 75% 175.000000 max 10000.000000 Name: price, dtype: float64
The steps are:
price column from the variable nycquantile() method on the price dataq3q3q3= nyc['price'].quantile(0.75)
print("q3:",q3)
q3: 175.0
The steps are:
price column from the variable nycquantile() method on the price dataq1q1## find the 25th percentile value
q1= nyc['price'].quantile(0.25)
print("q1:",q1)
q1: 69.0
The steps are:
q3 from q1iqriqriqr= q3 - q1
print("iqr:",iqr)
iqr: 106.0
The steps are:
for upper bound,
upper_boundupper_boundfor lower bound,
lower_boundlower_boundupper_bound= q3 + 1.5*iqr
print("upper bound",upper_bound)
lower_bound= q1 - 1.5*iqr
print("lower bound",lower_bound)
upper bound 334.0 lower bound -90.0
The steps are:
for lower_point,
lower_bound: calculated in the previous stepnyc['price'].min() : the minimum value in the nyc['price'] datalower_pointlower_pointfor upper_point,
upper_bound: calculated in the previous stepnyc['price'].max() : the maximum value in the nyc['price'] dataupper_pointupper_pointlower_point= max(lower_bound,nyc['price'].min())
print("lower_point", lower_point)
upper_point= min(upper_bound,nyc['price'].max())
print("upper_point", upper_point)
lower_point 0 upper_point 334.0
The steps are:
price column from nyc dataframeclip() method on the price columnlower_point: the lower point of price dataupper_point: the upper point of price dataprice column of nyc dataframe to make the changes permanentnyc['price'] = nyc['price'].clip(lower_point, upper_point)
The steps are:
price column from the variable nycdescribe() method on the price datanyc['price'].describe()
count 48895.000000 mean 132.979753 std 83.530504 min 0.000000 25% 69.000000 50% 106.000000 75% 175.000000 max 334.000000 Name: price, dtype: float64
The steps are:
px, call the strip() method to generate the strip plotnyc: variable where the data is storedprice: column data to plot in the y axisprice_strip2 that will save the plot in this variableprice_strip2 using the show() method# plt.scatter(x= nyc.index, y= nyc['price'])
# #plt.hist(nyc['price'],20)
# plt.show()
final= px.strip(nyc, y='price')
final.show()
The steps are:
boxplot() method to generate the boxplotcolumn: the column data to plot for the boxplotfigsize(optional): to define the size of the figure in terms of rows and columnsfontsize(optional): to show the texts size in the figurevert(optional): the allignment(x or y axis) of the plot. Value False means horizontal(x axis) alignment, True vertical alignmentboxplot() method to the variable nyc, where our data is storedbox_price2box_price2 variable to display the boxplotbox_price2= nyc.boxplot(column='price', figsize=(10,5), fontsize='8', vert=False)
box_price2
<AxesSubplot:>
By using the clip method, we have removed our outliers from the price data. Now using this dataset will give us good predictions of hotel prices.